This is problem set #2, in which we hope you will practice the visualization package ggplot2, as well as hone your knowledge of the packages tidyr and dplyr.
Note, that this example is from the_grammar.R on http://had.co.nz/ggplot2 I’ve adapted this for psych 254 purposes
First install and load the package.
#install.packages("ggplot2")
library(ggplot2)
Now we’re going to use qplot. qplot is the easy interface, meant to replace plot. You can give it simple qplot(x,y) examples, or slightly more complex examples like qplot(x, y, col=grp, data=d).
We’re going to be using the diamonds dataset. This is a set of measurements of diamonds, along with their price etc.
library(ggplot2) # not finding diamonds data if ggplot library not loaded in same chunk as diamonds called
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
qplot(diamonds$carat, diamonds$price)
Scatter plots are trivial, and easy to add features to. Modify this plot so that it uses the dataframe rather than working from variables in the general namespace (good to get away from retyping diamonds$ every time you reference a variable).
qplot(carat, price, data = diamonds)
Try adding clarity and cut, using shape and color as your visual variables.
qplot(carat, price, shape = cut, color = clarity, data = diamonds)
# using shape for cut b/c <= 6 cuts
One of the primary benefits of ggplot2 is the use of facets - also known as small multiples in the Tufte vocabulary. That last plot was probably hard to read. Facets could make it better. Try adding a facets = x ~ y argument. x ~ y means row facets are by x, column facets by y.
qplot(carat, price, data = diamonds, facets = cut ~ clarity)
But facets can also get overwhelming. Try to strike a good balance between color, shape, and faceting.
HINT: facets = . ~ x puts x on the columns, but facets = ~ x (no dot) wraps the facets. These are underlying calls to different functions, facet_wrap (no dot) and facet_grid (two arguments).
qplot(carat, price, color = clarity, data = diamonds, facets = ~ cut)
The basic unit of a ggplot plot is a “geom” - a mapping between data (via an “aesthetic”) and a particular geometric configuration on coordinate axes.
Let’s try some other geoms and manipulate their parameters. First, try a histogram (geom="hist").
qplot(price, data = diamonds, geom = 'histogram')
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Now facet your histogram by clarity and cut.
qplot(price, data = diamonds, geom = 'histogram', facets = clarity ~ cut)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
I like a slightly cleaner look to my plots. Luckily, ggplot allows you to add “themes” to your plots. Try doing the same plot but adding + theme_bw() or + theme_classic(). Different themes work better for different applications, in my experience.
qplot(price, data = diamonds, geom = 'histogram', facets = clarity ~ cut) + theme_bw()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot is just a way of building qplot calls up more systematically. It’s sometimes easier to use and sometimes a bit more complicated. What I want to show off here is the functionality of being able to build up complex plots with multiple elements. You can actually do this using qplot pretty easily, but there are a few things that are hard to do.
ggplot is the basic call, where you specify A) a dataframe and B) an aesthetic mapping from variables in the plot space to variables in the dataset.
d <- ggplot(diamonds, aes(x=carat, y=price)) # first you set the aesthetic and dataset
d + geom_point() # then you add geoms
d + geom_point(aes(colour = carat)) # and you can keep doing this to add layers to the plot
Try writing this as a single set of additions (e.g. one line of R code, though you can put in linebreaks). This is the most common workflow for me.
ggplot(diamonds, aes(x = carat, y = price, colour = carat)) + geom_point()
You can also set the aesthetic separately for each geom, and make some great plots this way. Though this can get complicated. Try using ggplot to build a histogram of prices.
ggplot(diamonds, aes(x = price)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Sklar et al. (2012) claims evidence for unconscious arithmetic processing. We’re going to do a reanalysis of their Experiment 6, which is the primary piece of evidence for that claim. The data are generously contributed by Asael Sklar.
First let’s set up a few preliminaries.
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.1.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.1.2
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sem <- function(x) {sd(x) / sqrt(length(x))}
ci95 <- function(x) {sem(x) * 1.96}
First read in two data files and subject info. A and B refer to different trial order counterbalances.
wsubinfo <- read.csv("../data/sklar_expt6_subinfo_corrected.csv")
d.a <- read.csv("../data/sklar_expt6a_corrected.csv")
d.b <- read.csv("../data/sklar_expt6b_corrected.csv")
Gather these datasets into long form and get rid of the Xs in the headers.
d.a.long <- gather(d.a, subid, rt, X1:X21)
d.a.long$subid <- sub('X', '', d.a.long$subid)
d.b.long <- gather(d.b, subid, rt, X22:X42)
d.b.long$subid <- sub('X', '', d.b.long$subid)
Bind these together. Check out bind_rows.
d.all.long <- bind_rows(d.a.long, d.b.long)
Merge these with subject info. You will need to look into merge and its relatives, left_join and right_join. Call this dataframe d, by convention.
d.all.long$subid <- as.numeric(d.all.long$subid) # match data types across data frames
d <- left_join(d.all.long, wsubinfo)
## Joining by: "subid"
stopifnot(nrow(d) == nrow(d.all.long)) # should preserve length
Clean up the factor structure.
d$presentation.time <- factor(d$presentation.time)
levels(d$operand) <- c("addition","subtraction")
Examine the basic properties of the dataset. First, take a histogram.
ggplot(d, aes(x = rt)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Challenge question: what is the sample rate of the input device they are using to gather RTs?
# plot RTs
ggplot(d, aes(x = rt)) + geom_histogram(binwidth = 1)
# zoom in
ggplot(d, aes(x = rt)) + geom_histogram(binwidth = 1) + scale_x_continuous(limits = c(500, 600)) # zoom in
# get numbers
table(d$rt[d$rt > 500 & d$rt < 600])
##
## 523 524 525 559 560 561 595 596 597
## 96 210 93 134 299 129 195 477 161
It looks like the sample rate is 36 ms, with some rounding errors of +/- 1 ms.
Sklar et al. did two manipulation checks. Subjective - asking participants whether they saw the primes - and objective - asking them to report the parity of the primes (even or odd) to find out if they could actually read the primes when they tried. Examine both the unconscious and conscious manipulation checks (this information is stored in subinfo). What do you see? Are they related to one another?
wsubinfo <- mutate(wsubinfo, inclusion.status = ifelse(objective.test < .6 & subjective.test == 0, 'kept', 'dropped'))
ggplot(wsubinfo, aes(x = objective.test, y = subjective.test)) + geom_point(aes(color = inclusion.status), alpha = .5, size = 5)
summary(glm(subjective.test ~ objective.test, wsubinfo, family = "binomial"))
##
## Call:
## glm(formula = subjective.test ~ objective.test, family = "binomial",
## data = wsubinfo)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.37183 -0.82622 -0.07408 0.65056 1.96951
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.881 1.922 -3.060 0.00221 **
## objective.test 9.363 3.135 2.987 0.00282 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 58.224 on 41 degrees of freedom
## Residual deviance: 41.577 on 40 degrees of freedom
## AIC: 45.577
##
## Number of Fisher Scoring iterations: 5
There is a large dynamic range of objective test score, and also a large number of participants who reported being able to see the numbers. As such, many participants were dropped for failing to meet the inclusion criteria. It also looks like there is a positive relationship between the objective and subjective tests, which was confirmed by logistic regression.
OK, let’s turn back to the measure and implement Sklar et al.’s exclusion criterion. You need to have said you couldn’t see (subjective test) and also be not significantly above chance on the objective test (< .6 correct). Call your new data frame ds.
ds <- d[d$objective.test < .6 & d$subjective.test == 0, ]
Sklar et al. show a plot of a “facilitation effect” - the time to respond to incongruent primes minus the time to respond to congruent primes. They then show plot this difference score for the subtraction condition and for the two presentation times they tested. Try to reproduce this analysis.
HINT: first take averages within subjects, then compute your error bars across participants, using the sem function (defined above).
# subject data
d.subj <- summarise(group_by(ds, subid, operand, presentation.time, congruent, objective.test), rt = mean(rt, na.rm = TRUE)) %>%
spread(congruent, rt) %>%
mutate(facilitation = no - yes)
# summarize data
d.summary <- summarise(group_by(d.subj, operand, presentation.time), facilitation.mean = mean(facilitation), facilitation.sem = sem(facilitation))
Now plot this summary, giving more or less the bar plot that Sklar et al. gave (though I would keep operation as a variable here. Make sure you get some error bars on there (e.g. geom_errorbar or geom_linerange).
(ggplot(subset(d.summary, operand == 'subtraction'), aes(x = presentation.time, y = facilitation.mean))
+ geom_bar(stat = "identity")
+ geom_errorbar(aes(ymin = facilitation.mean - facilitation.sem, ymax = facilitation.mean + facilitation.sem), width = .5))
# check 2000 ms condition
t.test(d.subj[d.subj$operand == 'subtraction' & d.subj$presentation.time == 2000, ]$facilitation)
##
## One Sample t-test
##
## data: d.subj[d.subj$operand == "subtraction" & d.subj$presentation.time == 2000, ]$facilitation
## t = 2.243, df = 8, p-value = 0.05516
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.2800138 20.2312480
## sample estimates:
## mean of x
## 9.975617
What do you see here? How close is it to what Sklar et al. report? Do the error bars match? How do you interpret these data?
The means in tihs plot match the plot by Sklar et al. and suggest that participants respond more quickly following the congruent relative to the incongruent primes (at least in the 1700 ms condition; p = .055 in the 2000 ms condition; but the main effect of congruency is significant). The error bars do not match the plots by Sklar et al., as they are twice as large as those in the paper.
Challenge problem: verify Sklar et al.’s claim about the relationship between RT and the objective manipulation check.
ggplot(d.subj, aes(x = objective.test, y = facilitation, color = operand)) + geom_point(size = 3) + geom_smooth(method = 'lm')
summary(lm(facilitation ~ objective.test, data = d.subj, subset = operand == "subtraction"))
##
## Call:
## lm(formula = facilitation ~ objective.test, data = d.subj, subset = operand ==
## "subtraction")
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.484 -8.642 -1.552 8.354 37.115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -28.65 29.98 -0.955 0.355
## objective.test 86.85 58.97 1.473 0.161
##
## Residual standard error: 15.54 on 15 degrees of freedom
## Multiple R-squared: 0.1263, Adjusted R-squared: 0.06809
## F-statistic: 2.169 on 1 and 15 DF, p-value: 0.1615
Sklar et al. report that in Experiment 7 there was a negative relationship between scores on the objective test and facilitation scores (I believe this analysis was restricted to the subtraction condition). In other words, any conscious awareness of the primes was actually hurting rather than driving their effect. This relationship is not replicated in Experiment 6. There is no significant relationship between the objective test and facilitation scores (in the subtraction condition), and numerically the relationship is positive rather than negative.
Show us what you would do with these data, operating from first principles. What’s the fairest plot showing a test of Sklar et al.’s original hypothesis that people can do arithmetic “non-consciously”?
The fairest plots would either be (a) collapsing across presentation time and operand, or (b) reporting the data for all cells of the presentation time x operand design. Both are implemented below.
# summary for main effects
d.summary.main <- summarise(group_by(ds, subid, congruent), rt = mean(rt, na.rm = TRUE)) %>%
spread(congruent, rt) %>%
mutate(facilitation = no - yes) %>%
summarise(facilitation.mean = mean(facilitation), facilitation.sem = sem(facilitation))
(ggplot(d.summary.main, aes(x = 0, y = facilitation.mean))
+ geom_bar(stat = "identity")
+ geom_errorbar(aes(ymin = facilitation.mean - facilitation.sem, ymax = facilitation.mean + facilitation.sem), width = .5)
+ scale_x_continuous(limits = (c(-1, 1)), breaks = c(), name = ""))
# show all four cells
(ggplot(d.summary, aes(x = presentation.time, y = facilitation.mean))
+ geom_bar(stat = "identity")
+ geom_errorbar(aes(ymin = facilitation.mean - facilitation.sem, ymax = facilitation.mean + facilitation.sem), width = .5)
+ facet_grid(. ~ operand))
## Warning: Stacking not well defined when ymin != 0
Challenge problem: Do you find any statistical support for Sklar et al.’s findings?
# t-test
d.subj.clps <- summarise(group_by(ds, subid, congruent), rt = mean(rt, na.rm = TRUE)) %>%
spread(congruent, rt) %>%
mutate(facilitation = no - yes)
t.test(d.subj.clps$facilitation)
##
## One Sample t-test
##
## data: d.subj.clps$facilitation
## t = 1.6693, df = 16, p-value = 0.1145
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -1.377207 11.581397
## sample estimates:
## mean of x
## 5.102095
Computing a facilitation score for each subject that collapses across operands and then collapsing these facilitation scores across presentation time does not yield a significant priming effect, t(16) = 1.67, p = .11.